The Problem with Kappa
نویسنده
چکیده
It is becoming clear that traditional evaluation measures used in Computational Linguistics (including Error Rates, Accuracy, Recall, Precision and F-measure) are of limited value for unbiased evaluation of systems, and are not meaningful for comparison of algorithms unless both the dataset and algorithm parameters are strictly controlled for skew (Prevalence and Bias). The use of techniques originally designed for other purposes, in particular Receiver Operating Characteristics Area Under Curve, plus variants of Kappa, have been proposed to fill the void. This paper aims to clear up some of the confusion relating to evaluation, by demonstrating that the usefulness of each evaluation method is highly dependent on the assumptions made about the distributions of the dataset and the underlying populations. The behaviour of a number of evaluation measures is compared under common assumptions. Deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged. For most performance evaluation purposes, the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended. Introduction Research in Computational Linguistics usually requires some form of quantitative evaluation. A number of traditional measures borrowed from Information Retrieval (Manning & Schütze, 1999) are in common use but there has been considerable critical evaluation of these measures themselves over the last decade or so (Entwisle & Powers, 1998, Flach, 2003, Ben-David. 2008). Receiver Operating Analysis (ROC) has been advocated as an alternative by many, and in particular has been used by Fürnkranz and Flach (2005), Ben-David (2008) and Powers (2008) to better understand both learning algorithms relationship and the between the various measures, and the inherent biases that make many of them suspect. One of the key advantages of ROC is that it provides a clear indication of chance level performance as well as a less well known indication of the relative cost weighting of positive and negative cases for each possible system or parameterization represented. ROC Area Under the Curve (Fig. 1) has been also used as a performance measure but averages over the false positive rate (Fallout) and is thus a function of cost that is dependent on the classifier rather than the application. For this reason it has come into considerable criticism and a number of variants and alternatives have been proposed (e.g. AUK, Kaymak et. Al, 2010 and H-measure, Hand, 2009). An AUC curve that is at least as good as a second curve at all points, is said to dominate it and indicates that the first classifier is equal or better than the second for all plotted values of the parameters, and all cost ratios. However AUC being greater for one classifier than another does not have such a property – indeed deconvexities within or
منابع مشابه
An interior-point algorithm for $P_{ast}(kappa)$-linear complementarity problem based on a new trigonometric kernel function
In this paper, an interior-point algorithm for $P_{ast}(kappa)$-Linear Complementarity Problem (LCP) based on a new parametric trigonometric kernel function is proposed. By applying strictly feasible starting point condition and using some simple analysis tools, we prove that our algorithm has $O((1+2kappa)sqrt{n} log nlogfrac{n}{epsilon})$ iteration bound for large-update methods, which coinc...
متن کاملCorrector-predictor arc-search interior-point algorithm for $P_*(kappa)$-LCP acting in a wide neighborhood of the central path
In this paper, we propose an arc-search corrector-predictor interior-point method for solving $P_*(kappa)$-linear complementarity problems. The proposed algorithm searches the optimizers along an ellipse that is an approximation of the central path. The algorithm generates a sequence of iterates in the wide neighborhood of central path introduced by Ai and Zhang. The algorithm does not de...
متن کاملتفسیر نتایج ایمنوهیستوشیمی بیان HER2/neu در سرطان مهاجم پستان: بررسی توافق بین مشاهدهکنندگان و در دو بار مشاهده توسط یک نفر
Background and Aim: The accurate determination of HER-2 in invasive breast cancer has become a critical issue, particularly in context of the results of Herceptin adjuvant therapy. The aim of this study was to evaluate inter- and intraobserver reproducibility of assessment of HER2/neu immunostaining in invasive breast cancerMaterials and Methods: This study was cross sectional and the conve...
متن کاملDiagnostic concordance among dermatopathologists in basal cell carcinoma subtyping: Results of a study in a skin referral hospital in Tehran, Iran
Background: Basal cell carcinomas (BCC) are the most prevalent among non-melanoma skin cancers (NMSC), which correspond to the most common skin cancers. BCC histopathological subtyping is a problem in therapeutic management. Therefore, we have decided to perform a histopathologic study for better classification of BCCs based on interobserver diagnostic judgment. Methods: We conducted this cross...
متن کاملValue at Risk Estimation using the Kappa Distribution with Application to Insurance Data
The heavy tailed distributions have mostly been used for modeling the financial data. The kappa distribution has higher peak and heavier tail than the normal distribution. In this paper, we consider the estimation of the three unknown parameters of a Kappa distribution for evaluating the value at risk measure. The value at risk (VaR) as a quantile of a distribution is one of the import...
متن کاملModified index of orthodontic treatment needs (IOTN) assessment by senior dental students
Modified index of orthodontic treatment needs (IOTN) assessment by senior dental students Dr. AR. Akbarzadeh Bagheban*-Dr. SMR. Safavi** -Dr. M. Nouri*** -Dr. L. Eslamian** -Dr. GR. Babaei**** - Dr. A. Kazemnejad**** - Dr. S. Faghihzadeh**** *Assistant Professor of Biostatistics Dept., Shaheed Beheshti University of Medical Sciences. **Associate Professor of Orthodontics Dept., Faculty of Denti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012